Interactively Skimming Recorded Speech
نویسندگان
چکیده
Listening to a speech recording is much more difficult than visually scanning a document because of the transient and temporal nature of audio. Audio recordings capture the richness of speech, yet it is difficult to directly browse the stored information. This dissertation investigates techniques for structuring, filtering, and presenting recorded speech, allowing a user to navigate and interactively find information in the audio domain. This research makes it easier and more efficient to listen to recorded speech by using the SpeechSkimmer system. First, this dissertation describes Hyperspeech, a speech-only hypermedia system that explores issues of speech user interfaces, browsing, and the use of speech as data in an environment without a visual display. The system uses speech recognition input and synthetic speech feedback to aid in navigating through a database of digitally recorded speech. This system illustrates that managing and moving in time are crucial in speech interfaces. Hyperspeech uses manually segmented and structured speech recordings—a technique that is practical only in limited domains. Second, to overcome the limitations of Hyperspeech while retaining browsing capabilities, a variety of speech analysis and user interface techniques are explored. This research exploits properties of spontaneous speech to automatically select and present salient audio segments in a time-efficient manner. Two speech processing technologies, time compression and adaptive speech detection (to find hesitations and pauses), are reviewed in detail with a focus on techniques applicable to extracting and displaying speech information. Finally, this dissertation describes SpeechSkimmer, a user interface for interactively skimming speech recordings. SpeechSkimmer uses simple speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail. User interaction, through a manual input device, provides continuous real-time control of the speed and detail level of the audio presentation. SpeechSkimmer incorporates time-compressed speech, pause removal, automatic emphasis detection , and non-speech audio feedback to reduce the time needed to listen. This dissertation presents a multi-level structural approach to auditory skimming, and user interface techniques for interacting with recorded speech. Thesis Supervisor Christopher M. Schmandt Principal Research Scientist, Program in Media Arts and Sciences This work was performed in the Media Laboratory at MIT. Support for this research was provided, in part, by Apple Computer, Inc., Interval Research Corporation, and Sun Microsystems, Inc. The ideas expressed herein do not necessarily reflect those of the supporting agencies.
منابع مشابه
New Touch Screen Application
An adaptive speech rate control technology for ultra fast listening that is equivalent to skimming is described. Nowadays, listening to audio books on mobile devices is quite common. People read books at various levels of detail from close reading to skimming. Although a similar feature to skimming is required to efficiently obtain information from audio sources, there is no tool equivalent to ...
متن کاملInteractive Speech Skimming via Time-stretched Audio Replay
Time stretching, sometimes also referred to as time scaling, is a term describing techniques for replaying speech signals faster (i.e., time compressed) or slower (i.e., time expanded) while preserving their characteristics, such as pitch and timbre. One example for such an approach is the SOLA (synchronous overlap and add) algorithm (Roucus & Wilgus, 1985), which is often used to avoid cartoon...
متن کاملSearching Recorded Speech Based on the Temporal Extent of Topic Labels
Recorded speech poses unusual challenges for the design of interactive end-user search systems. Automatic speech recognition is sufficiently accurate to support the automated components of interactive search systems in some applications, but finding useful recordings among those nominated by the system can be difficult because listening to audio is time consuming and because recognition errors ...
متن کاملForward and Backward Speech Skimming with the Elastic Audio Slider
In pursuit of the goal to make recorded speech as easy to skim as printed text, a variety of methods and user interfaces have been suggested in the literature, involving time-compressed audio, speech segmentation and recognition, etc. We propose a new user interface, the elastic audio slider, which makes navigation in speech documents similar to video navigation or text scrolling. The approach ...
متن کاملCombining non-uniform unit selection with diphone based synthesis
This paper describes the unit selection algorithm of a speech synthesis system, which selects the k-best paths over units from a relational unit database. The algorithm uses words and diphones as basic unit types. It is part of a customisable textto-speech system designed for generating new prompts using a recorded speech corpus, with the option that the user can interactively optimise the resu...
متن کامل